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1  Abstract 


In  the  ninth  quarter  of  the  work  effort,  we  focused  on  a)  conducting  experiments  on  real-world 
data  sets  using  the  developed  algorithms,  b)  continued  design/implementation  of  the  Multiscale 
Heat-Kernel  Coordinates  (MHKC)  algorithms  and  c)  packaging  for  releasing  the  software  as 
open  source.  This  report  documents  algorithm  designs  for  the  MHKC  algorithms. 

The  project  is  currently  on  track  -  in  the  upcoming  quarter,  we  will  continue  applying  the 
developed  algorithms  to  various  data  sets  and  the  design/implementation  of  the  multiscale  heat 
kernel  coordinates  algorithms.  No  problems  are  currently  anticipated. 


Use  or  disclosure  of  data  contained  on  this  sheet  is  subject  to  restrictions  on  the  title  page  of  this  report. 


Page  ii 


ArrLItu  - - — - 

Communication 

SCIENCES 


ISRN  TELCORDIA-2013-10+PR-0GARAU 
Technical  Progress  Report 
Table  of  Contents 


Table  of  Contents 

1  ABSTRACT . II 

2  SUMMARY . 1 

3  INTRODUCTION . 2 

4  METHODS,  ASSUMPTIONS  AND  PROCEDURES . 3 

4.1  Multiscale  Heat  Kernel  Coordinates . 3 

4.1.1  Representation  Using  Canonical  Clustering . 4 

4.1.2  Representation  Using  Multiscale  SVD . 5 

4.2  Deliverables  /  Milestones . 5 

5  RESULTS  AND  DISCUSSION . 6 

6  CONCLUSIONS . 7 

7  REFERENCES . 8 


Use  or  disclosure  of  data  contained  on  this  sheet  is  subject  to  restrictions  on  the  title  page  of  this  report. 


Page  iii 


ArrLItU  - - — - 

Communication 

sciences 


ISRN  TELCORDIA-2013-10+PR-0GARAU 
Technical  Progress  Report 
Summary 


2  Summary 


In  this  quarter,  we  continued  design  and  implementation  of  the  new  multiscale  heat  kernel 
coordinates  (MHKC)  algorithms.  The  current  design  variants  for  MHKC  algorithms  are 
documented  in  this  report. 

The  project  is  currently  on  track  -  in  the  upcoming  quarters,  we  will  continue  applying  the 
developed  algorithms  to  various  data  sets  and  focus  on  the  design  and  development  of  the 
MHKC  algorithms.  No  problems  are  currently  anticipated. 
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3  Introduction 


The  primary  project  effort  over  the  last  quarter  focused  on  completing  design/development  of  the 
multiscale  heat-kemel  coordinates  algorithms  [1],  This  provides  a  power  tool  for  discovering  the 
non-linear  geometries  in  any  given  dataset.  This  utilizes  the  fast  randomized  Singular  Value 
Decomposition  (RSVD)  algorithms  described  in  the  earlier  ONR  reports  [7]  [8].  Use  of  the 
RSVD  effectively  reduces  the  computational  complexity  from  O(m.n.k)  to  0((m+n).k  )  for  an  m 
by  n  matrix  of  rank  k.  In  contrast  to  the  multiscale  Singular  Value  Decomposition  (MSVD) 
algorithms  that  detect  linear  structures  in  data  at  multiple  scales,  the  MHKC  uses  heat  kernels  to 
discover  the  non-linear  manifold  structure  in  which  the  data  resides  at  various  scales.  Similar  to 
the  MSVD,  the  MHKC  provides  an  efficient  representation  using  low-dimensional  coordinates 
corresponding  to  the  original  data  points. 

An  outline  of  the  MHKC  algorithm  was  presented  in  the  previous  quarterly  report  [10].  While 
most  of  the  algorithm  is  automated,  the  crucial  step  of  selecting  the  appropriate  heat-kernel 
coordinates  for  any  given  application  required  manual  intervention  on  part  of  the  data  analyst.  In 
this  report,  we  present  two  canonical  approaches  to  automating  the  selection  of  the  MHKC 
embedding.  Further,  it  also  provides  a  way  to  visualize  the  embedded  data  in  lower  (2  or  3) 
dinmensions. 
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4  Methods,  Assumptions  and  Procedures 


4.1  Multiscale  Heat  Kernel  Coordinates 

The  Multiscale  Heat  Kernel  Coordinates  (MHKC)  algorithms  are  based  on  theoretical  results 
presented  in  [1],  The  current  algorithm  design  is  described  below. 

Input:  A  set  of  n  data  points  {x1,  x2, ... ,  xn]  in  Rd.  Assume  n  is  large. 

Step  1  (Normalization):  Normalize  the  points  xt  such  that  the  data  cloud  is  in  a  ball  of  unit 
variance.  Define 


IL 

2j 

£=1 


X  —  X 


2 


where  x  =  ~T,f=i  xi  -  The  normalized  point  yt  corresponding  to  Xj  is  given  by 

Xi~X 


Note :  The  translation  to  mean  zero  is  not  necessary  for  the  purposes  of  building  the  transition 
probability  matrix  in  the  next  step. 

Step  2  (Transition  Probability  Matrix):  The  second  step  comprises  constructing  the  data 
matrix  to  be  provided  as  input  to  the  RSVD  algorithm.  Define  the  heat  kernel  as 

(  ll*~ 

/c(x,y)  =  exp  I - 

\  t0 

for  any  two  points  x  and  y.  Here,  to  is  a  constant  (data  dependent)  representing  the  kernel  window 
size  (set  t0  =  2 ~say2  for  scale  s  >  0;  select  s  representing  some  finer  scale  of  interest).  The  heat 
kernel  matrix  is  then  defined  as 


K={kl} }  where  k\j  =  kix,.  Xj) 

for  i,  j  =  1,2 The  transition  probability  matrix  is  P  =  D  lK  where  D  is  the  diagonal  matrix 
with  the  i-th  entry  as  sum  of  the  z-th  row  of  K. 

Note :  For  large  n,  compute  /?  ~  25  elements  for  each  row  of  K  using  the  randomized 
approximate  nearest  neighbor  algorithm  ([9]).  This  reduces  the  computational  complexity  from 
0(n2.  d )  to  0(n.  log(ji).  d)  and  captures  local  information. 
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Note  that  P  is  not  symmetric.  There  are  various  techniques  to  symmetrize  P  such  that  the 
eigenvalues  and  eigenfunctions  are  still  easy  to  compute.  One  way  is  to  define 

P’  =  Dm.P.Dm 

P'  is  symmetric  with  the  same  eigenvalues  as  P.  Also,  the  eigenvectors  can  easily  be  easily 
obtained  using  a  simple  transformation  of  either  D  or  D  .  The  RSVD  algorithm  may  be  used 
to  compute  the  spectrum  of  P\ 

Step  3  (MHKC  Embedding):  Next,  the  heat  kernel  coordinates  is  defined  for  each  of  the 
original  data  points.  Let  the  eigenvalues  of  P  be  defined  as  Aj  and  the  right-eigenvectors  as  vj  for  j 
=  l,2,...,rank(P). 

Each  point  xi  is  then  represented  as  HKC(x 0  =  (  exp(-~A\l).v\ \,  exp(-X2t).vii,  exp{-'kyt).vr\  ) 
where  vji  is  the  i-th  coordinate  of  the  eigenvector  vj.  Here  t  is  the  time/scale  parameter  that  is  to 
be  varied  to  look  at  the  geometries  of  the  data  set  at  various  scales. 

Note:  The  first  eigenvalue/eigenvector  of  P  is  trivial  and  should  not  be  used. 

Next,  we  provide  two  approaches  to  automating  the  selection  of  heat-kernel  coordinates  in  Step  2 
of  the  algorithm  described  above. 

4.1.1  Representation  Using  Canonical  Clustering 

Choose  a  small  integer  l  representing  the  number  of  non-trivial  eigenvectors  Vj  to  consider 
(obtained  via  the  algorithm  in  Section  4.1).  Construct  a  binary  tree  using  the  first  l  non-trivial 
eigenvectors  to  divide  the  n  points  into  Nc  —  2l  clusters  as  follows.  First,  divide  the  points  into 
two  sets  defined  by 


{*■  I  y }  >  llU y/}  and  [x,  I  yf  <  i I?-,?/} 

where  yf  —  vk(xi)  using  the  first  diffusion  vector.  Apply  the  2nd  diffusion  vector  to  both  these 
sets;  repeat  recursively  to  obtain  Nc  disjoint  clusters.  For  each  cluster  c  —  1,2, ...  ,NC,  find  the 
closest  point  u(xM(c))  to  the  cluster  mean. 

Select  a  suitable  time  t  (same  for  all  the  clusters)  and  use  the  heat  kernel  k(x,y)  to  get  a  vector 
of  length  Nc  given  by 


Zj  (k(xi,  xM(1)),  k^Xj,  x^(2))>  ■■■  >  k(xi>  ^ovc))) 

for  each  Xj.  Now,  perform  standard  PCA  on  the  dataset  (zj.  To  visualize  the  dataset,  simply  use 
projections  on  the  first  2  or  3  dimensions. 
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4.1.2  Representation  Using  Multiscale  SVD 

For  each  k,  normalize  (y^  ,y2 , ...  ,y£)  to  have  unit  length.  Define  z\  —  yf/  YIj=i  yf  ■  Pick  a 
small  integer,  say  2.  Out  of  the  top  ten  non-trivial  diffusion  eigenvectors,  choose  the  two  values 
kt,  k2  such  that  the  set  (in  2D)  consisting  of  all  the  points 

{OiVf2)  I  i  =  1.2,  ...,n} 

has  the  smallest  average  value  of  multiscale  SVD  at  scales  2~s  for  s  =  0,1, 2, 3  (at  each  location 
and  scale,  compute  the  average  squared  distance  to  the  best  fitting  line).  Amongst  the  various 
“best”  choices,  pick  the  one  with  the  smallest  value  of  kt  +  k2. 

4.2  Deliverables  /  Milestones 


Date 

Deliverables  /  Milestones 

Status 

Oct  2010 

Progress  report  for  period  1,  1st  quarter 

V" 

Jan  2011 

Progress  report  for  period  1,  2nd  quarter  /  complete  randomized  matrix  decompositions  task 

V 

Apr  2011 

Progress  report  for  period  1,  3ld  quarter  /  complete  approximate  nearest  neighbors  task 

Jul  201 1 

Progress  report  for  period  1,  4th  quarter  /  complete  experiments  -  part  1 

V" 

Oct  201 1 

Progress  report  for  period  2,  1st  quarter 

V' 

Jan  2012 

Progress  report  for  period  2,  2nd  quarter  /  complete  multiscale  SVD  task 

V 

Apr  2012 

Progress  report  for  period  2,  3rd  quarter 

V7 

Jul  2012 

Progress  report  for  period  2,  4th  quarter  /  complete  experiments  -  part  2 

V' 

Oct  2012 

Progress  report  for  period  3,  1st  quarter 

V 

Jan  2013 

Progress  report  for  period  3,  2nd  quarter  /  complete  multiscale  Heat  Kernel  task 

Apr  2013 

Progress  report  for  period  3,  3rd  quarter 

Jul  2013 

Final  project  report  +  software  +  documentation  on  CDROM  /  complete  experiments  -  part  3 
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5  Results  and  Discussion 


We  described  two  approaches  to  automate  the  process  of  selecting  the  “appropriate”  diffusion 
vectors  for  a  given  dataset.  The  first  approach  in  itself  provides  an  agnostic  and  canonical  way  of 
“clustering”.  In  terms  of  computational  cost,  the  first  approach  is  much  better  as  it  avoids  the 
potential  combinatorial  explosion  in  the  second  approach.  However,  the  second  approach 
directly  evaluates  the  information  content  for  each  diffusion  vector  in  a  multiscale  sense  and 
picks  the  “best”  combination.  We  will  experimentally  evaluate  both  these  techniques  against 
real-world  datasets. 
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6  Conclusions 


The  project  is  on  track  with  design/implementation  of  the  new  multiscale  heat  kernel  coordinates 
algorithms.  We  will  continue  with  algorithmic  improvements  and  experimentation  using  the 
developed  algorithms  in  the  next  quarter. 

No  problems  are  currently  anticipated. 
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